Summary

This notebook consists of an experiment using the VADER algorithm to perform sentiment analysis on the Tweets and also cluster them based on sentiment.

VADER is lexicon based and especially made for the contexts of social media websites like Twitter and Facerbook that assigns a sentence a compound sentiment score in the range [-1,1], -1 being very negative, 0 neutral and 1 very positive. VADER is also able to detect valence, which is the usage of certain words to shift the polarity of sentiment in text without a change in word usage. For example, saying “I like pizza” conveys positive sentiment by the word “like”, while saying “I don’t like pizza” conveys negative sentiment by the addition of “don’t”, event though the positive word “like” is still kept.

We query 10,000 tweets from the COVID Events Dataset with the text filter "cure prevent" in order to get tweets somewhat realated to COVID-19 cures and prevention methods. We then use VADER to compute the sentiment score of each tweet and proceed to cluster them based on their embeddings using k-means. Afterwards, we take each cluster separetly, multiply the embedding vector by the sentiment score and perform subclustering on these rescaled tweets.

While the meaning, subtopic or discussion of each subcluster are still yet unclear, we note that the sentiment scaled embedding clusters are very clear distinct when visualized in 2D using t-SNE. The R implementation of VADER is very simple to use and works well with dataframes without the need of extra preprocessing, thus, it could be worthwhile to run it on all available tweets in the backend in order to simplify future temporal and spatial analysis.

Configure the search parameters here - set date range and semantic phrase:

THIS SECTION IS IDENTICAL TO THE ORIGINAL NOTEBOOK

Note: large date ranges can take some time to process on initial search due to the sheer volume of data we have collected. Subsequent searches using the same date range should run quickly due to Elasticsearch caching.

# query start date/time (inclusive)
rangestart <- "2020-01-01 00:00:00"

# query end date/time (exclusive)
rangeend <- "2020-08-01 00:00:00"

# text filter restricts results to only those containing words, phrases, or meeting a boolean condition. This query syntax is very flexible and supports a wide variety of filter scenarios:
# words: text_filter <- "cdc nih who"  ...contains "cdc" or "nih" or "who"
# phrase: text_filter <- '"vitamin c"' ...contains exact phrase "vitamin c"
# boolean condition: <- '(cdc nih who) +"vitamin c"' ...contains ("cdc" or "nih" or "who") and exact phrase "vitamin c"
#full specification here: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-simple-query-string-query.html
text_filter <- "cure prevent"

# query semantic similarity phrase (choose one of these examples or enter your own)
#semantic_phrase <- "Elementary school students are not coping well with distance learning."
#semantic_phrase <- "How do you stay at home when you are homeless?"
#semantic_phrase <- "My wedding has been postponed due to the coronavirus."
#semantic_phrase <- "I lost my job because of COVID-19. How am I going to be able to make rent?"
#semantic_phrase <- "I am diabetic and out of work because of coronavirus. I am worried I won't be able to get insulin without insurance."
#semantic_phrase <- "There is going to be a COVID-19 baby boom..."
#semantic_phrase <- "Vitamin"
semantic_phrase <- ""

# return results in chronological order or as a random sample within the range
# (ignored if semantic_phrase is not blank)
random_sample <- FALSE
# number of results to return (max 10,000)
resultsize <- 10000

####TEMPORARY SETTINGS####
# number of subclusters per high level cluster (temporary until automatic selection implemented)
cluster.k <- 3
# show/hide extra info (temporary until tabs are implemented)
show_original_subcluster_plots <- FALSE
show_regrouped_subcluster_plots <- TRUE
show_word_freqs <- FALSE
show_center_nn <- FALSE

Using VADER and Clustering

Next, we add a new column to the dataframe of tweets consisting of the VADER compound sentiment score.

####################################################
# Compute and attach tweet sentiment to each tweet
####################################################

tweet.vectors.df$sentiment <- c(0)
tweet.vectors.df$sentiment <- vader_df(tweet.vectors.df$full_text)[,"compound"]

We then cluster the tweet embedding using k-means with 4 clusters. This specific number of clusters was achieved using the NbClust package and function, which computes several metrics for a given range of clusters and returns the best number as given by mojority rule. It runs very slowly when computing all metrics, but as Dr. Erickson mentioned, there might be a way to parallelize this computation, which would be worthwhile in the context of an app.

###############################################################################
# Run K-means on all the tweet embedding vectors
###############################################################################

# Run NbClust to find an optimal number of clusters, takes a while
#k <- NbClust(data = tweet.vectors.matrix, min.nc = 2, max.nc = 20, method = "kmeans") 
k <- 4 # Use 4 or any other integer for quick trials

set.seed(300)
km <- kmeans(tweet.vectors.matrix, centers=k, iter.max=30)

tweet.vectors.df$vector_type <- factor("tweet", levels=c("tweet", "cluster_center", "subcluster_center"))
tweet.vectors.df$cluster <- as.factor(km$cluster)

#append cluster centers to dataset for visualization
centers.df <- data.frame(full_text=paste("Cluster (", rownames(km$centers), ") Center", sep=""),
                         user_screen_name="[N/A]",
                         user_verified="[N/A]",
                         user_location="[N/A]",
                         user_location_type = "[N/A]",
                         class = "[N/A]",
                         is_specific_event = "[N/A]",
                         opinion = "[N/A]",
                         vector_type = "cluster_center",
                         cluster=as.factor(rownames(km$centers)),
                         sentiment=0.0)
tweet.vectors.df <- rbind(tweet.vectors.df, centers.df)
tweet.vectors.matrix <- rbind(tweet.vectors.matrix, km$centers)

Afterwards, we consider each cluster, multiply the embeddings by their sentiment score and perform subclustering on these scaled vectors using k-means. Here, we utilize 3 clusters so as to convey tweets with positive, neutral and negative sentiment, which is a decision not made on the basis of elbow plots or NbClust for the sake of time. However, once cluster number selection is automated, it’d be only natural to appy it here too.

###########################################################
# Obtain compund sentiment of clustered tweets using VADER
# Then perform subclustering based on tweet vectors scaled
# by sentiment score
###########################################################
tweet.vectors.df$subcluster <- c(0)

for (i in 1:k) {
  set.seed(500)
  cluster.tweets <- tweet.vectors.df[tweet.vectors.df$cluster == i,]
  cluster.matrix <- tweet.vectors.matrix[tweet.vectors.df$cluster == i,]
  cluster.matrix.sentiment.applied <- sweep(cluster.matrix, MARGIN = 1, cluster.tweets$sentiment, `*`)
  cluster.km <- kmeans(cluster.matrix.sentiment.applied, centers=cluster.k, iter.max=30)
  tweet.vectors.df[tweet.vectors.df$cluster == i, "subcluster"] <- cluster.km$cluster
 
 #append subcluster centers to dataset for visualization
   centers.df <- data.frame(full_text=paste("Subcluster (", rownames(cluster.km$centers), ") Center", sep=""),
                           user_screen_name="[N/A]",
                           user_verified="[N/A]",
                           user_location="[N/A]",
                           user_location_type = "[N/A]",
                           class = "[N/A]",
                           is_specific_event = "[N/A]",
                           opinion = "[N/A]",
                           vector_type = "subcluster_center",
                           cluster=as.factor(i),
                           sentiment=0.0,
                           subcluster=rownames(cluster.km$centers))
   tweet.vectors.df <- rbind(tweet.vectors.df, centers.df)
   tweet.vectors.matrix <- rbind(tweet.vectors.matrix, cluster.km$centers)
}
tweet.vectors.df$subcluster <- as.factor(tweet.vectors.df$subcluster)

We now visualize the clusters and subclusters using t-SNE. The cluster plot shows the original embeddings, while the subcluster plots show the rescaled embeddings. We can clearly see how each subcluster is well defined.

## [1] "Plotting cluster 1 ..."
## [1] "Plotting cluster 2 ..."
## [1] "Plotting cluster 3 ..."
## [1] "Plotting cluster 4 ..."
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

While this experiment shows some potential in using sentiment in clustering there are still many questions left unanswered:

  • What’s the sentiment distribution in each cluster? Is it unimodal or bimodal?
  • We used sentiment in subclustering, but could we use it before clustering or after subclustering?
  • What does each subcluster mean? Are the subclusters different subtopics with their own sentiment profile or are they different takes on the cluster topic?
  • All the temporal, spatial and reactionary questions still floating around.